NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Predicting runtime and resource utilization of jobs on integrated cloud and HPC systems

https://doi.org/10.1016/j.future.2025.108230

Yildirim, Esma; Hussein, Mohab; Titov, Mikhail; Kilic, Ozgur Ozan (March 2026, Future Generation Computer Systems)

Recent advances in virtualization technologies used in cloud computing offer performance that closely approaches bare-metal levels. Combined with specialized instance types and high-speed networking services for cluster computing, cloud platforms have become a compelling option for high-performance computing (HPC). However, most current batch job schedulers in HPC systems are designed for homogeneous clusters and make decisions based on limited information about jobs and system status. Scientists typically submit computational jobs to these schedulers with a requested runtime that is often over- or under-estimated. More accurate runtime predictions can help schedulers make better decisions and reduce job turnaround times. They can also support decisions about migrating jobs to the cloud to avoid long queue wait times in HPC systems. In this study, we design neural network models to predict the runtime and resource utilization of jobs on integrated cloud and HPC systems. We developed two monitoring strategies to collect job and system resource utilization data using a workload management system and a cloud monitoring service. We evaluated our models on two Department of Energy (DOE) HPC systems and Amazon Web Services (AWS). Our results show that we can predict the runtime of a job with 31–41 % mean absolute percentage error (MAPE), 14–17 seconds mean absolute value error (MAE), and 0.99 R-squared (R²) score. Having an MAE of less than a minute corresponds to 100 % accuracy since the requested time for batch jobs is always specified in hours and/or minutes
more » « less
Free, publicly-accessible full text available March 1, 2027
Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads

https://doi.org/10.1145/3731599.3767587

Merzky, Andre; Titov, Mikhail; Turilli, Matteo; Jha, Shantenu (November 2025, ACM)

Free, publicly-accessible full text available November 15, 2026
Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Sarker, Arup; Alsaadi, Aymen; Halpern, Alexander; Tangella1, Prabhath; Titov, Mikhail; Perera, Niranda; Staylor, Mills; von_Laszewski, Gregor; Jha, Shantenu; Fox, Geoffrey (June 2025, 28th edition of the workshop on Job Scheduling Strategies for Parallel Processing. JSSPP 2025 https://jsspp.org/)

Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.
more » « less
Free, publicly-accessible full text available June 7, 2026
Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Sarker, Arup; Alsaadi, Aymen; Halpern, Alexander; Tangella, Prabhath; Titov, Mikhail; Perera, Niranda; Staylor, Mills; Laszewski, Gregor von; Jha, Shantenu; Fox, Geoffrey (June 2025, Springer. JSSPP 2025: Job Scheduling Strategies for Parallel Processing)

Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like \texttt{MPI}, \texttt{GLOO} and \texttt{NCCL} across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.
more » « less
Free, publicly-accessible full text available June 3, 2026
Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing

Sarker, Arup Kumar; Alsaadi, Aymen; Perera, Niranda; Staylor, Mills; von_Laszewski, Gregor; Turilli, Matteo; Kilic, Ozgur O; Titov, Mikhail; Merzky, Andre; Jha, Shantenu; et al (December 2024, Springer Nature Switzerland)

Full Text Available
Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing

https://doi.org/10.1007/978-3-031-74430-3_5

Sarker, Arup Kumar; Alsaadi, Aymen; Perera, Niranda; Staylor, Mills; von_Laszewski, Gregor; Turilli, Matteo; Kilic, Ozgur Ozan; Titov, Mikhail; Merzky, Andre; Jha, Shantenu; et al (December 2024, Springer Nature Switzerland)

Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon’s design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-Communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4 15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.Not Available
more » « less
Full Text Available
Enabling Performance Observability for Heterogeneous HPC Workflows with SOMA

Yokelson, Dewi; Titov, Mikhail; Ramesh, Srinivasan; Kilic, Ozgur O; Turilli, Matteo; Jha, Shantenu; Malony, Allen (August 2024, ACM)

Full Text Available
Adaptive Protein Design Protocols and Middleware

https://doi.org/10.1109/IPDPSW66978.2025.00157

Alsaadi, Aymen; Ash, Jonathan; Titov, Mikhail; Turilli, Matteo; Merzky, Andre; Jha, Shantenu; Khare, Sagar (June 2025, IEEE)

Free, publicly-accessible full text available June 3, 2026
Enabling Performance Observability for Heterogeneous HPC Workflows with SOMA

https://doi.org/10.1145/3673038.3673100

Yokelson, Dewi; Titov, Mikhail; Ramesh, Srinivasan; Kilic, Ozgur; Turilli, Matteo; Jha, Shantenu; Malony, Allen (August 2024, ACM)

Full Text Available
Design and Performance Characterization of RADICAL-Pilot on Leadership-Class Platforms

https://doi.org/10.1109/TPDS.2021.3105994

Merzky, Andre; Turilli, Matteo; Titov, Mikhail; Al-Saadi, Aymen; Jha, Shantenu (April 2022, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available

« Prev Next »

Search for: All records